1. Introduction

This report will run through some basic steps for how to approach a dataset for basic data exploration and regression analyses in R.

Specifically, it will address:

  • Loading and cleaning data
  • Principal Component Analysis
  • Correspondence Analysis
  • Cluster Analysis
  • Linear Discriminant Analysis
  • Linear Regressions (and Q-Q plots)
  • Logistic Regression
  • Analysis of Variance (ANOVA)

This document will assume that the user has at least a very basic understanding of the R language and some familiarity with basic statistics.

For most of the code related to data processing and plotting, we will use the tidyverse, which is a collection of well-curated libraries/packages specifically for data science. The tidyverse is fast becoming the industry standard, but all of what will be covered here can also be done in Base R (the core functions that come with R).

Generating this document used R v3.6.1 and tidyverse v1.3.0. Full specifications of the R environment will be printed at the end of this report.

The report itself was constructed using Rmarkdown in Rstudio, which is a markup language that allows R code to easily be inserted in a formatted text document.

2. Setting up

It is good practice to stay organized, by loading the necessary libraries that will be required for the code at the start of a project. Here, we will use the following packages:

library(tidyverse)
library(GGally)
library(ggfortify)
library(plotly)
library(caret)
library(factoextra)
library(FactoMineR)

The above code assumes that these packages are already installed on your system and only need to be loaded. If this is not the case, packages can be installed using the following code

my.packages<-c("tidyverse"
               "GGally"
               "ggfortify"
               "plotly"
               "caret"
               "factoextra"
               "FactoMineR")
install.packages(my.packages, dependencies = T)

Next, it is important to set a working directory, the location on your system where all files related to this project will be housed. Once this is set, it is recommended to use reletive paths for any subsequent file management. This is so that if they project folder on your system is moved, only this working directory needs to be adjusted, not any other paths you may have introduced in your code.

setwd("~/My Cloud/upwork/Cornier/")

On my system, this project is based in the ~/My Cloud/upwork/Cornier/, but this will be different for each user.

Note: Mac and Windows code file paths slightly differnetly. To get an ideay of how paths are constructed, run getwd() to show the current directory you are in.

We are now all set to load the files that we want to work with. They should of course be in the working directory we have set above. Seeing as the file we want to work with is comma separated table of data (.csv), R will know what to do if we invoke the following function.

dat<-read_csv("2015.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   Region = col_character(),
##   `Happiness Rank` = col_double(),
##   `Happiness Score` = col_double(),
##   `Standard Error` = col_double(),
##   `Economy (GDP per Capita)` = col_double(),
##   Family = col_double(),
##   `Health (Life Expectancy)` = col_double(),
##   Freedom = col_double(),
##   `Trust (Government Corruption)` = col_double(),
##   Generosity = col_double(),
##   `Dystopia Residual` = col_double()
## )

3. Inspect and explore data

It is always good practice to explore the data. Here, tabulating and plotting is key.

3.1 Data classes

Data come in different forms. It is a good idea to check what format the data columns actually are. The read_csv() function we used to load the data already does this for us, indicating that 12 columns of data were loaded, most of these are numeric (col_double()), but two of them character strings (col_character).

We could inspect the shape of this data object as well to see how many columns and rows the table contains:

dim(dat)
## [1] 158  12

We could also summarize each of the columns like so:

summary(dat)
##    Country             Region          Happiness Rank   Happiness Score
##  Length:158         Length:158         Min.   :  1.00   Min.   :2.839  
##  Class :character   Class :character   1st Qu.: 40.25   1st Qu.:4.526  
##  Mode  :character   Mode  :character   Median : 79.50   Median :5.232  
##                                        Mean   : 79.49   Mean   :5.376  
##                                        3rd Qu.:118.75   3rd Qu.:6.244  
##                                        Max.   :158.00   Max.   :7.587  
##  Standard Error    Economy (GDP per Capita)     Family      
##  Min.   :0.01848   Min.   :0.0000           Min.   :0.0000  
##  1st Qu.:0.03727   1st Qu.:0.5458           1st Qu.:0.8568  
##  Median :0.04394   Median :0.9102           Median :1.0295  
##  Mean   :0.04788   Mean   :0.8461           Mean   :0.9910  
##  3rd Qu.:0.05230   3rd Qu.:1.1584           3rd Qu.:1.2144  
##  Max.   :0.13693   Max.   :1.6904           Max.   :1.4022  
##  Health (Life Expectancy)    Freedom       Trust (Government Corruption)
##  Min.   :0.0000           Min.   :0.0000   Min.   :0.00000              
##  1st Qu.:0.4392           1st Qu.:0.3283   1st Qu.:0.06168              
##  Median :0.6967           Median :0.4355   Median :0.10722              
##  Mean   :0.6303           Mean   :0.4286   Mean   :0.14342              
##  3rd Qu.:0.8110           3rd Qu.:0.5491   3rd Qu.:0.18025              
##  Max.   :1.0252           Max.   :0.6697   Max.   :0.55191              
##    Generosity     Dystopia Residual
##  Min.   :0.0000   Min.   :0.3286   
##  1st Qu.:0.1506   1st Qu.:1.7594   
##  Median :0.2161   Median :2.0954   
##  Mean   :0.2373   Mean   :2.0990   
##  3rd Qu.:0.3099   3rd Qu.:2.4624   
##  Max.   :0.7959   Max.   :3.6021

This gives us basic statistics such as the means for each column of data, and most importantly, it also shows us that there are no missing data (NA’s).

3.2 Missing data

Missing data (NA’s) may be a problem for some analyses. The easiest way to deal with this is to simply remove any row that contains at least one missing entry. This is of course quite blunt and should be done with caution.

dat <- dat %>%
  drop_na()

The drop_na() function can also be used more specifically to only remove rows that have NA’s for specific variables/columns, like so:

dat <- dat %>%
  drop_na(Region)

It may be that missing data is coded in a specific way. For example, the original dataset may contain the written string “not available” which R may not intuitively recognise as missing. To tell R that such values should be treated as missing values (NA’s), this can be specified in read_csv(), by including the argument na=c(“not available”).

3.3. Selecting and Filtering Data

Among the columns of data may be variables that are not of interest or for some reason pose problems for downstream analyses. To tidy up the dataset, these can be removed using the select() function. By including a -, columns will be deselected:

dat <- dat %>%
  select(-`Standard Error` ,-`Dystopia Residual`,-`Happiness Rank`)

Note: You may notice that some variables are engulfed in ``. This is because these variable names have spaces in them, which R needs to undertand are actual spaces in the names and not spaces between one variable and the next.

If for any reason you wish to filter out rows of data based on some criterion, this can be done using the filter() function. For example, if any rows with the fictional Country name “Gondor” should be removed from the dataset, this can be done like so:

dat <- dat %>%
  filter(Country!="Gondor")

Here, != refers to “not equal to”, where the opposite would be ==, which would keep any rows where the country name is “Gondor”.

3.4 Plotting variables

By plotting variables we can get an initial idea how what the data looks like and whether we are likely to find any interesting patterns (correlations) as well as whether we are likely to run into any problems when running models later on.Typical problems include inter-correlation of variables, zero inflation, outliers or deviations from normal distribution. One handy tool for getting an overview is a pair-plot, which work particularly well for numeric, continuous data:

dat %>%
  select(-Country, -Region) %>%
  ggpairs() +
  theme_minimal()

This plot shows us that we are unlikely to have serious problems with inter-correlation of variables, which would show up as very diagonal arrangements of points in the scatter plots and high correlation coefficients in the top left diagonal of the plot matrix. Health and Economy (correlation coefficient of 0.816) is a case where indeed, the correlation of these two variables is something to keep in mind if we were to include both of them in a model.

These plots are also good for detecting clear outliers. This dataset shows no real problematic data points, at best the Generosity score for Myanmar deserves attention, which falls outside the otherwise neat cluster of scores for the remaining countries. It might be worth going back to the original dataset to see if this is a mistake. If need be, it could be filtered out as shown above.

We can also see that the Happiness Score is positively correlated with Economy, Health, Family and Freedom, and less so with Generosity for example. We are therefore already getting important insight into potentially interesting relationships.

If the Happiness Score is what we are interested in predicting (i.e. the response variable in our models), then we should also be careful about how it is distributed as many models will assume that the response variable is normally distributed. The distribution curve shown here on the plot is not terribly problematic, but we may want to test this specifically using a statistical test such as a Shapiro-Wilk test. More on this in the Regression Analysis section below.

If distributions in the diagonal of the plot matrix are multi-modal, it may help to divide the plots into any categorical variable we may have. As an exercise I will do this with the Region variable. Keep in mind that there are 10 regions in this dataset which is perhaps a bit much and will confuse the plot rather than making it useful.

dat %>%
  select(-Country, -Region) %>%
  ggpairs(mapping = aes(color=dat$Region)) +
  theme_minimal()

4. Principal Component Analysis

A Principal Component Analysis (PCA) is often included under the umbrella of data exploration as it involves no underlying statistical test. It is a data rotation method that takes a dataset with many variables and reduce these to try to represent as much of the variation present in the data in as few axes as possible. These new axes/variables are called Principal Components. An additional benefit of a PCA is that the resulting principal components are completely orthogonal. This means that there is no inter-correlation of variables and can thus be extremely useful for modelling, especially for predictive models.

Here we will perform a PCA on the predictor variables only, which are all continuous numeric and have no missing values.

It is good practice to scale and centre the data, particularly if the different variables are measured in different units. This can be done internally by the PCA function prcomp(), but later one we will also do this manually.

modPCA<-dat %>%
  select(-Country, -Region, -`Happiness Score`) %>%
  prcomp(scale. = T, center = T)

summary(modPCA)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.6961 1.1509 0.8385 0.74827 0.62074 0.38748
## Proportion of Variance 0.4795 0.2208 0.1172 0.09332 0.06422 0.02502
## Cumulative Proportion  0.4795 0.7003 0.8174 0.91076 0.97498 1.00000

The summary() call on the modPCA object shows us the importance of each of the principal components. Particularly the Proportion of Variance is useful as it tells us how much of the variance of the original data is represented in each component. As is the nature of a PCA, this value reduces with each component. To asses how many of the components are useful, we can look for a flattening out on a scree plot:

screeplot(modPCA, type="lines")

Although the pattern is not extremely strong in this case, after the 3rd component, little change in slope is observed in the plot, indicative that additional components will add little extra information.

We might also be interested in what variables contribute most to these first three components, as this will tell us a little bit about which variables are most involved in structuring the data. This information is stored in the rotation table inside the modPCA object:

modPCA$rotation %>%
  as.data.frame() %>%
  select(1:3)
##                                      PC1        PC2         PC3
## Economy (GDP per Capita)      -0.4996342  0.3529158 -0.03381742
## Family                        -0.4507056  0.2198387  0.22329214
## Health (Life Expectancy)      -0.4795306  0.2887749  0.14609250
## Freedom                       -0.4192966 -0.3742016 -0.06599536
## Trust (Government Corruption) -0.3312477 -0.4264630 -0.72188099
## Generosity                    -0.1781737 -0.6494866  0.63418170

The absolute values are important here (i.e. negative or positive does not matter). In this case, there is no variable that contributes particularly more than any other to the fist component (PC1), but we can see that Economy, Health and Family are the top contributors. This is to say that data points, countries in this case, that have largely differing values of PC1 will also have largely differing values for Economy, Family etc. In the same way, countries with strongly differing values along the second component (PC2) are widely different values for Generosity. Plotting these components as a PCA biplot will represent this information graphically, but is to be limited to two (maximum three) axes at a time:

autoplot(modPCA, x=1, y=2,
         data=dat,
         scale=0,
         size=2,
         loadings=T, loadings.colour="grey",
         loadings.label = TRUE,
         loadings.label.size = 3,
         loadings.label.colour="darkgrey") +
  theme_minimal()

PCAs can be particularly useful for finding groups in data. For example, we could ask if the variation in the data is explained by regional differences by colouring points by Region and adding convex hulls around points belonging to the same region.

# pca biplot using autoplot()
autoplot(modPCA, x=1, y=2,
         data=dat,
         scale=0,
         colour='Region',
         size=2,
         frame = TRUE) +
  theme_bw()

In the same way, we can see how the continuous Happiness Score correlates with variation in the data (I will use the ggplot() function here instead of the above used autoplot() to show that the same plots can be reproduced using different methods):

# pca plot using ggplot()
p<-modPCA$x %>%
    as.data.frame() %>%
    mutate(`Happiness Score`=dat$`Happiness Score`) %>%
    ggplot(aes(x=PC1, y=PC2, color=`Happiness Score`)) +
    geom_point() +
    scale_colour_gradient(low = "yellow", high = "red", na.value = NA) +
    theme_minimal()

p

This last PCA biplot shows that there is a gradient of happiness along the first principal component. As the main contributing variables to this component are Economy, Health and Family, it would be reasonable to expect that we will find these variables are important for explaining differences in happiness of countries in subsequent statistical models.

Just for fun we could also make this plot interactive using the Plotly package. Go ahead and hover over the plot to see which points refer to each country.

modPCA$x %>%
  as.data.frame() %>%
  mutate(`Happiness Score`=dat$`Happiness Score`) %>%
  mutate(Country=dat$Country)%>%
  plot_ly(x=~PC1,
          y=~PC2,
          type="scatter",
          mode="markers",
          color = ~`Happiness Score`,
          hoverinfo="text",
          text = ~Country)

5. Correspondence Analysis

A Correspondence Analysis (CA) could be viewed as an extension of the PCA, but instead of summarizing the variation of continuous (numeric) data, a CA works with categorical data. The key processing step is to construct a contingency (frequency) table from categorical variables. The only two categorical variables in our dataset are Country and Region. They won’t make for the most exciting variables, as they are completely nested (a country can only be in one single region). Let’s make some more variables to have a slightly better example. To keep with the happiness theme of this document, lets categorize the Happiness Score into four categories so that we can use it for the Correspondence Analysis to see if countries belonging to the same region are similar in their happiness scores.

happy.index<-as.factor(as.numeric(cut(dat$`Happiness Score`, 4)))

The next step is to construct the contingency table:

con.table <- table(dat$Region, happy.index)
head(con.table)
##                                  happy.index
##                                    1  2  3  4
##   Australia and New Zealand        0  0  0  2
##   Central and Eastern Europe       0 14 14  1
##   Eastern Asia                     0  2  4  0
##   Latin America and Caribbean      0  3 10  9
##   Middle East and Northern Africa  1  9  5  5
##   North America                    0  0  0  2

This contingency table is all we need to fit a basic CA.

modCA<-CA(con.table, graph = F)

Similarly to the PCA, we can plot the results of this model as a biplot. This is often called a symmetric plot, here showing the first two dimensions:

fviz_ca_biplot(modCA,
               repel = TRUE)

This plot shows a global overview of the data with the columns (happy) as red triangles and the rows (Region) as blue points. The closeness of the blue points to each other shows how similar the regions are with respect to their happiness scores and the closeness of the red triangles to each other shows how similar the four happiness categories are with respect to the regions they contain.

6. Clustering

Cluster analyses intend to group data into subgroups in such a way that the members of the same group are more similar to each other than they are to members of other groups. They are superficially similar to ordination methods such as a PCA and CA, because they are unsupervised, meaning they do not have a response variable. In other words, we do not have any specific idea a priori how the data points should be grouped. This is a very extensive topic and there are many statistical methods as well as R packages to do this. Here I will cover two of these: hierarchical and k-means clustering.

6.1 Hierarchical clustering

Hierarchical clustering does not require an a priori number of clusters and can therefore be useful to see how clustered or diffused the data points are.

The first step is to create a distance matrix between data points. The choice of distance method is very important for this analysis and many methods exist. Euclidean distance is perhaps the most widely used method and often the default for many programs. We will therefore be using it here as well. It can handle both numeric and categorical variables, but for simplicity, we will use just the continuous variables in our dataset just as we did for the PCA. Again, seeing as the variables are of different units, it is good practice to scale them first.

# prepare and scale data
dat.cluster<-dat %>%
    select(-Region, -Country, -`Happiness Score`) %>%
    scale()

# calculate distance matrix
d<-dat.cluster %>%
    dist(method = "euclidean")

We can now fit the clustering algorithm using the Ward’s method, which is particularly good for trying to build tight clusters. There are however a good number of other methods such as UPGMA which may be more appropriate depending on the data, and worth exploring.

h.cluster <- hclust(d,
                    method="ward.D")

The best way to display the results of such a hierarchical analysis is using a dendrogram.

plot(h.cluster,
     labels=dat$Country,
     cex=0.5)

Countries close together on the dendrogram are countries that are similar in their variable values. One might now want to cut this tree into clusters to define groups of countries that are similar to each other.

One way to do this is to cut the tree (the dedrogram) into a predefined number of clusters. Let’s say five:

plot(h.cluster,
     labels=dat$Country,
     cex=0.5)

rect.hclust(h.cluster, k=5, border="red") 

Alternatively, we could cut the branches of the tree at a specific height (i.e. at a specific distance) and see how many clusters we are left with.

plot(h.cluster,
     labels=dat$Country,
     cex=0.5)
rect.hclust(h.cluster, h=15, border="red") 

To work with the resulting clusters, it may be useful to extract the cluster numbers for each country. This can be achieved with cutree() in the same manner:

country.clusters<-data.frame(Country=dat$Country,
                             cluster=cutree(h.cluster, k=5))

head(country.clusters)
##       Country cluster
## 1 Switzerland       1
## 2     Iceland       1
## 3     Denmark       1
## 4      Norway       1
## 5      Canada       1
## 6     Finland       1

Note: for visually more appealing dendrograms, try out the fviz_dend() function.

6.2 K-means clustering

This is perhaps the most popular clustering method. In k-means clustering, all data points are assigned to a predetermined number of clusters in a way that minimizes the distance to their cluster’s centroid. This is achieved through an iterative process, generally handled internally by the appropriate R functions.

The first steps for k-means clustering are the same as for hierarchical clustering: A distance matrix must be constructed on the scaled data. We will therefore directly use the already created d object and we will try to cluster the data into 5 groups.

k.cluster<-kmeans(d,
                  centers=5,
                  nstart = 25)

fviz_cluster(k.cluster, data=dat.cluster) +
  theme_bw()

To see which country has been assigned to which cluster we can adopt a similar code to earlier

country.clusters<-data.frame(Country=dat$Country,
                             cluster=k.cluster$cluster)

head(country.clusters)
##       Country cluster
## 1 Switzerland       4
## 2     Iceland       4
## 3     Denmark       4
## 4      Norway       4
## 5      Canada       4
## 6     Finland       4

A potential limitation of k-means clustering is that the user must specify how many clusters are to be expected. It is often the case that this is not known and one way to try to figure out what the optimum number of clusters is, is to apply the elbow or Within Sum of Squares (WSS) method. The underlying principal here is that multiple kmeans models are run with an increasing number of clusters and at the point where adding new clusters no longer significantly improves the ‘compactness’ of the clusters, is when the optimum number has been reached. WSS is the measure of the cluster compactness and plotting total WSS per clusters should help us to find the an inflexion point (the “elbow”) where additional clusters no longer drastically change the total WSS:

fviz_nbclust(dat.cluster,
             kmeans,
             method = "wss")

In this case, there is no strong elbow, but after about four or five clusters, the plot starts to level out. Four or five is therefore a good number of clusters for this data.

7. Linear Discriminant Analysis

In contrast to the above clustering methods, a Linear Discriminant Analysis is hypothesis driven which means it requires a response variable in the form of an a priori grouping factor, and is therefore supervised. Similar to a PCA, it can also be used to reduces the variable space and can be useful for picking out informative predictors in large, multivariate data, but it is much more powerful. A LDA is in fact a machine learning method and is a useful tool for predictive or classification modelling. Although not strictly necessary, it is common practice with these kinds of algorithms the data is split into training and testing subsets so that the model performance can be evaluated. Usually, training sets used to build the model make up 75-80% of the original dataset leaving the remaining 20-25% to test how well the model actually performs.

This method requires a categorical, grouping variable. Although the Region variable of the current dataset would lend itself to this, if we want to continue using Happiness Score as our variable of interest, we could re-use the happy.index categorization that we created for the CA analysis above.

head(happy.index)
## [1] 4 4 4 4 4 4
## Levels: 1 2 3 4

We can now split the data into training and test, and use the testing data to construct the LDA model. There are multiple R packages that can do this, I prefer the caret packages as it is specifically built for machine learning analyses and could easily be expanded beyond an LDA. It is also an extremely flexible framework and allows for pre-processing of data for example, fitting a PCA first if there is strong inter-correlation of variables. Here we will use the variables directly.

First let’s make the dataset we want to use.

# make lda data set
dat.lda<-dat %>%
  select(-Region,-Country, -`Happiness Score`) %>%
  mutate(happy.index=happy.index) %>%
  as.data.frame()

Next, we have to make the rules for how we want to split the dataset into testing and training data and make those datasets.

# make rule for partitioning the data, here using 75% for traning
inTrain<-createDataPartition(y=dat.lda$happy.index,
                             p=0.75,
                             list=F)

# partition dataset into training and testing based on the rules set up.
training<-dat.lda[inTrain,]
testing<-dat.lda[-inTrain,]

Now we can run the LDA model on the training data.

## run LDA
modLDA <- train(data=training,
                happy.index~ .,
                method = "lda")

The final step is now to use the test data that we set aside earlier to test how well the model actually performs. We do this by asking the model to predict the happiness category of countries in the test data and then compare how many times it got it right or wrong, compared to the actually happiness categories.

predLDA <- predict(modLDA, newdata = testing)
confusionMatrix(testing$happy.index, predLDA)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 2 2 1 0
##          2 1 7 6 0
##          3 0 2 4 5
##          4 0 0 2 6
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3338, 0.6662)
##     No Information Rate : 0.3421          
##     P-Value [Acc > NIR] : 0.03231         
##                                           
##                   Kappa : 0.3084          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.66667   0.6364   0.3077   0.5455
## Specificity           0.91429   0.7407   0.7200   0.9259
## Pos Pred Value        0.40000   0.5000   0.3636   0.7500
## Neg Pred Value        0.96970   0.8333   0.6667   0.8333
## Prevalence            0.07895   0.2895   0.3421   0.2895
## Detection Rate        0.05263   0.1842   0.1053   0.1579
## Detection Prevalence  0.13158   0.3684   0.2895   0.2105
## Balanced Accuracy     0.79048   0.6886   0.5138   0.7357

One of the most important statistics from this evaluation is the Accuracy, which in this case is fairly low (50%). This means that for half the cases, the model would predict incorrectly which happiness group a country belongs to. With scores such as this, it would be worth while trying other, more complex machine learning methods such as boosting or random forest, but it might ultimately mean that more data needs to be collected.

Let’s assuming we are happy with the model performance, we could now use it to predict the happiness of people in completely new countries. If we are less interested in such predictions and we simply want to better understand structures in our data, we can determine which variables contributed most to distinguishing between the happiness group.

plot(varImp(modLDA))

This plot tells us the variable importance for all 4 groups of happiness. For all four groups, Health, Economy and Family are the most important factors determining happiness.

8. Regression analysis and Q-Q plots

8.1 Linear regressions

Regression models are the bread-and-butter of most statisticians as they are an intuitive and powerful tool for determining the relationship between two or more variables. These can be extremely extensive and complicated, but let’s stick to the basics. Lets say we are interested in the relationship between GDP and Happiness, which when plotted, looks like this:

dat %>%
  ggplot(aes(x=`Economy (GDP per Capita)`, y=`Happiness Score`)) +
  geom_point() + 
  geom_smooth(method="lm") + 
  theme_bw()

We can clearly see that as the GDP of a country increases, so does Happiness. The blue line represents the regression slope of a linear regression model. To be able to say that this slope is significantly more positive than a flat line (slope = 0), we can calculate the p value of the regression model’s test statistic, the t value.

modLM<-lm(data = dat,
          formula = `Happiness Score` ~ `Economy (GDP per Capita)`)
summary(modLM)
## 
## Call:
## lm(formula = `Happiness Score` ~ `Economy (GDP per Capita)`, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96394 -0.48777 -0.07157  0.55947  1.60705 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.499      0.133   26.30   <2e-16 ***
## `Economy (GDP per Capita)`    2.218      0.142   15.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7174 on 156 degrees of freedom
## Multiple R-squared:  0.6099, Adjusted R-squared:  0.6074 
## F-statistic: 243.9 on 1 and 156 DF,  p-value: < 2.2e-16

Here we can clearly see that the relationship between GDP and Happiness is very significant and positive, as indicated by the positive t-value and a very low p-value.

The model also has a respectable adjusted R-squared, which indicates how well the regression slope represents the data. If the points in the plot were very diffused, a low R-squared would be an indication that perhaps a linear model is not the best fit.

Where as the p-value often has hard (but arbitrary) thresholds of <0.05 or <0.001 above which a relationship is generally not considered significant, interpreting the R-squared is more ambiguous with values closer to 1 indicating a better fit of the model.

8.2 Residuals and model assumptions

As with any models, linear regressions have underlying assumptions and if the input data does not meet these assumptions, the model can no longer be trusted. One of the most important assumptions is that the response variable (the Happiness Score in this case) is normally distributed or at the very least, that the residuals of the model are normally distributed. We already know from the pair-plots in section 3.4 that the Happiness Score deviates slightly from this assumption, so it is important to check the residual distribution of the model we constructed above. This can be done visually with a Q-Q plot:

qqnorm(modLM$residuals)
qqline(modLM$residuals, col="red")

The points in this plot should ideally be perfectly lined up on the red line, although it is not unusual that they start to stray at the ends. If the points deviate drastically from this line, it is a good idea test normality explicitly, using a Shapiro-Wilk’s test for example.

shapiro.test(modLM$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  modLM$residuals
## W = 0.98714, p-value = 0.1546

A p-value below the common thresholds of p<0.05 or p<0.001 would indicate a significant departure from normality. As this is not the case here, we can assume that the model assumption is not violated.

NOTE: Though normality is the most common check that people do, other model assumptions such as idependence of data points and homoscedasticity of variance are also important to check.

8.3 A brief comment on multivariate and non-linear models

It is easy to visually represent the relationship of two variables, but models can of course be constructed using more variables. The possibilities for constructing such models are ample and here I will only show a very simple example of how multiple response variables could be included in a model.

modLM.multi<-lm(data = dat,
                formula= `Happiness Score` ~ `Economy (GDP per Capita)` + Generosity)
summary(modLM.multi)
## 
## Call:
## lm(formula = `Happiness Score` ~ `Economy (GDP per Capita)` + 
##     Generosity, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36245 -0.43946 -0.01671  0.54241  1.58794 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.0898     0.1642  18.816  < 2e-16 ***
## `Economy (GDP per Capita)`   2.2238     0.1359  16.369  < 2e-16 ***
## Generosity                   1.7038     0.4323   3.941 0.000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6862 on 155 degrees of freedom
## Multiple R-squared:  0.6454, Adjusted R-squared:  0.6409 
## F-statistic: 141.1 on 2 and 155 DF,  p-value: < 2.2e-16

It may also be the case that relationships may not be linear:

dat %>%
  ggplot(aes(x=`Economy (GDP per Capita)`, y=`Trust (Government Corruption)`)) +
  geom_point() + 
  geom_smooth(method="auto") +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

In such cases, the data may need to be transformed first, or different types of regression models need to be used.

9. Logistic Regression

Linear regression models are a type of generalized linear model or GLM. A logistic regression belongs to the same family and therefore shares many of the same properties. A key difference is that for logistic regressions, the response variable is binary (e.g. yes/no), although more complex variations exist.

Logistic regressions are particularly useful for prediction a binary outcome given a set of measurements, and finding a reasonable threshold for where this switch is likely occur. From the plot of GDP versus Trust, we saw that there seems to be an inflexion point in GDP above which the population starts to trust the government. As we do not have a binary response variable in the data, we can make one as an example:

dat$trust<-0
dat$trust[dat$`Trust (Government Corruption)`>0.3]<-1

For this new trust variable, any score above 0.3 we consider as the population trusting the government. We can now fit the model and plot the regression curve

# fit model
mod.lr<-glm(data=dat,
            formula = as.factor(trust) ~ `Economy (GDP per Capita)`,
            family = binomial)

# plot curve
dat %>%
  ggplot(aes(x=`Economy (GDP per Capita)`, y=trust)) +
  geom_point() + 
  stat_smooth(method="glm",
              method.args=list(family="binomial"),
              se=FALSE) +
  ylab("Trust in Government") +
  theme_bw()

The curve shows the probability that a country with any given GDP trusts its government. For example, if we now measure the GDP of a new country and it was at 1.0, we could predict that there is only about a 12.5% probability that there is trust in the Government. This prediction can be more accurately calculated for as many new countries as needed. Let’s imagine we have collected the GDP of several nations of Tolkien’s Middle-Earth.

middle_earth<-tibble(Country=c("Gondor","The Shire","Rohan","Mordor"),
                         'Economy (GDP per Capita)'=c(3.0,1.0,1.23,1.7))
middle_earth
## # A tibble: 4 x 2
##   Country   `Economy (GDP per Capita)`
##   <chr>                          <dbl>
## 1 Gondor                          3   
## 2 The Shire                       1   
## 3 Rohan                           1.23
## 4 Mordor                          1.7

we could now predict the trust in government for each using our model:

predict(mod.lr,
        newdata = middle_earth,
        type="response")
##         1         2         3         4 
## 0.9978171 0.1322252 0.2767390 0.7152046

10. Analysis of Variance

An Analysis of Variance (ANOVA) can also be considered a GLM and is again a hypothesis-driven analysis, but unlike in a linear or logistic regression, the response variable is a categorical, grouping variable (similar to the LDA). We can use an ANOVA to see if our data deviates from the null hypothesis that the means of a variable of interest are the same across different groups. For the current dataset for example, we could ask if the means of the Happiness Score are the same across the 10 different regions in the dataset. As always, it is good to plot this relationship first. Here, as a boxplot:

dat %>%
  ggplot(aes(x=Region, y=`Happiness Score`, fill=Region)) +
  geom_boxplot() +
  xlab("") +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Just by looking at this plot it seems like the mean Happiness Score is not the same across regions. To test this statistically, we can run a one-tailed ANOVA:

modANOVA<-aov(data = dat,
              formula = `Happiness Score` ~ Region)

summary(modANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Region        9 123.68  13.743   24.76 <2e-16 ***
## Residuals   148  82.15   0.555                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calling summary() on the model shows us the relevant test statistic, the F value, and its associated degrees of freedom and the p-value. As for the linear regression model, a p-value below 0.05 or 0.001 (a stricter threshold) suggests that the data significantly deviates from the null hypothesis. In other words, the means of the Happiness Score are not the same across the Regions in this case.

If we want to know which regions specifically are different from any other, we can perform a post-hoc test. This is essentially a series of pair-wise tests, with a multiple testing correction. The most common post-hoc test for an ANOVA is a Tukey test:

modANOVA.tuc<-TukeyHSD(modANOVA)

# lets print only the first couple of lines:
head(modANOVA.tuc$Region)
##                                                                diff       lwr
## Central and Eastern Europe-Australia and New Zealand      -1.952069 -3.701903
## Eastern Asia-Australia and New Zealand                    -1.658833 -3.613102
## Latin America and Caribbean-Australia and New Zealand     -1.140318 -2.908021
## Middle East and Northern Africa-Australia and New Zealand -1.878100 -3.653153
## North America-Australia and New Zealand                   -0.012000 -2.405481
## Southeastern Asia-Australia and New Zealand               -1.967556 -3.838626
##                                                                   upr
## Central and Eastern Europe-Australia and New Zealand      -0.20223493
## Eastern Asia-Australia and New Zealand                     0.29543583
## Latin America and Caribbean-Australia and New Zealand      0.62738479
## Middle East and Northern Africa-Australia and New Zealand -0.10304688
## North America-Australia and New Zealand                    2.38148113
## Southeastern Asia-Australia and New Zealand               -0.09648528
##                                                                p adj
## Central and Eastern Europe-Australia and New Zealand      0.01611274
## Eastern Asia-Australia and New Zealand                    0.17275655
## Latin America and Caribbean-Australia and New Zealand     0.54981473
## Middle East and Northern Africa-Australia and New Zealand 0.02883597
## North America-Australia and New Zealand                   1.00000000
## Southeastern Asia-Australia and New Zealand               0.03072284

The adjusted p-value indicates which pairs of Regions are significantly different from each other. To see just the significant pairs, try:

modANOVA.tuc$Region %>%
  as.data.frame() %>%
  select(`p adj`) %>%
  filter(`p adj`<0.001)
##                                                           p adj
## Southern Asia-Australia and New Zealand            5.063747e-04
## Sub-Saharan Africa-Australia and New Zealand       2.647803e-06
## Sub-Saharan Africa-Central and Eastern Europe      2.157968e-07
## Western Europe-Central and Eastern Europe          1.085010e-07
## Sub-Saharan Africa-Eastern Asia                    9.707860e-04
## Southern Asia-Latin America and Caribbean          1.392162e-04
## Sub-Saharan Africa-Latin America and Caribbean     1.170175e-13
## Sub-Saharan Africa-Middle East and Northern Africa 1.046675e-06
## Western Europe-Middle East and Northern Africa     6.799097e-06
## Southern Asia-North America                        5.492478e-04
## Sub-Saharan Africa-North America                   2.944852e-06
## Western Europe-Southeastern Asia                   3.420991e-04
## Western Europe-Southern Asia                       5.567561e-08
## Western Europe-Sub-Saharan Africa                  2.708944e-14

11. Reproducibility

For the purpose of reproducibility, the following packages/libraries and their versions were used. The information also provides details the system configurations used to run this code.

print(sessionInfo())
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] FactoMineR_2.3   factoextra_1.0.7 caret_6.0-85     lattice_0.20-38 
##  [5] plotly_4.9.1     ggfortify_0.4.11 GGally_1.4.0     forcats_0.4.0   
##  [9] stringr_1.4.0    dplyr_1.0.2      purrr_0.3.3      readr_1.3.1     
## [13] tidyr_1.0.2      tibble_3.0.3     ggplot2_3.2.1    tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1     ggsignif_0.6.0       ellipsis_0.3.0      
##   [4] class_7.3-15         rio_0.5.16           fs_1.3.1            
##   [7] rstudioapi_0.10      ggpubr_0.4.0         farver_2.0.3        
##  [10] ggrepel_0.8.1        prodlim_2019.11.13   fansi_0.4.1         
##  [13] lubridate_1.7.4      xml2_1.2.2           codetools_0.2-16    
##  [16] splines_3.6.1        leaps_3.1            knitr_1.27          
##  [19] jsonlite_1.6.1       pROC_1.16.1          broom_0.7.1         
##  [22] cluster_2.1.0        dbplyr_1.4.2         shiny_1.4.0         
##  [25] compiler_3.6.1       httr_1.4.1           backports_1.1.5     
##  [28] assertthat_0.2.1     Matrix_1.2-18        fastmap_1.0.1       
##  [31] lazyeval_0.2.2       cli_2.0.1            later_1.0.0         
##  [34] htmltools_0.4.0      tools_3.6.1          gtable_0.3.0        
##  [37] glue_1.4.2           reshape2_1.4.3       Rcpp_1.0.4          
##  [40] carData_3.0-3        cellranger_1.1.0     vctrs_0.3.4         
##  [43] nlme_3.1-143         iterators_1.0.12     crosstalk_1.0.0     
##  [46] timeDate_3043.102    gower_0.2.1          xfun_0.12           
##  [49] openxlsx_4.1.4       rvest_0.3.5          mime_0.8            
##  [52] lifecycle_0.2.0      rstatix_0.6.0        MASS_7.3-51.5       
##  [55] scales_1.1.0         ipred_0.9-9          hms_0.5.3           
##  [58] promises_1.1.0       RColorBrewer_1.1-2   curl_4.3            
##  [61] yaml_2.2.1           gridExtra_2.3        rpart_4.1-15        
##  [64] reshape_0.8.8        stringi_1.4.5        foreach_1.4.7       
##  [67] e1071_1.7-3          zip_2.0.4            lava_1.6.6          
##  [70] rlang_0.4.8          pkgconfig_2.0.3      evaluate_0.14       
##  [73] recipes_0.1.9        htmlwidgets_1.5.1    labeling_0.3        
##  [76] tidyselect_1.1.0     plyr_1.8.5           magrittr_1.5        
##  [79] R6_2.4.1             generics_0.0.2       DBI_1.1.0           
##  [82] foreign_0.8-75       pillar_1.4.3         haven_2.2.0         
##  [85] withr_2.1.2          abind_1.4-5          survival_3.1-8      
##  [88] scatterplot3d_0.3-41 nnet_7.3-12          modelr_0.1.5        
##  [91] crayon_1.3.4         car_3.0-6            utf8_1.1.4          
##  [94] rmarkdown_2.1        grid_3.6.1           readxl_1.3.1        
##  [97] data.table_1.12.8    ModelMetrics_1.2.2.1 reprex_0.3.0        
## [100] digest_0.6.25        flashClust_1.01-2    xtable_1.8-4        
## [103] httpuv_1.5.2         stats4_3.6.1         munsell_0.5.0       
## [106] viridisLite_0.3.0

12. Licencing and feedback

This code was written specifically for Virgile Cornier by Christoph Liedtke contracted via UpWork on 11 October 2020. For feedback or comments, please contact hc.liedtke@gmail.com.